Application of a deep learning system in glaucoma screening and further classification with colour fundus photographs: a case control study

Background To verify efficacy of automatic screening and classification of glaucoma with deep learning system. Methods A cross-sectional, retrospective study in a tertiary referral hospital. Patients with healthy optic disc, high-tension, or normal-tension glaucoma were enrolled. Complicated non-glaucomatous optic neuropathy was excluded. Colour and red-free fundus images were collected for development of DLS and comparison of their efficacy. The convolutional neural network with the pre-trained EfficientNet-b0 model was selected for machine learning. Glaucoma screening (Binary) and ternary classification with or without additional demographics (age, gender, high myopia) were evaluated, followed by creating confusion matrix and heatmaps. Area under receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, and F1 score were viewed as main outcome measures. Results Two hundred and twenty-two cases (421 eyes) were enrolled, with 1851 images in total (1207 normal and 644 glaucomatous disc). Train set and test set were comprised of 1539 and 312 images, respectively. If demographics were not provided, AUC, accuracy, precision, sensitivity, F1 score, and specificity of our deep learning system in eye-based glaucoma screening were 0.98, 0.91, 0.86, 0.86, 0.86, and 0.94 in test set. Same outcome measures in eye-based ternary classification without demographic data were 0.94, 0.87, 0.87, 0.87, 0.87, and 0.94 in our test set, respectively. Adding demographics has no significant impact on efficacy, but establishing a linkage between eyes and images is helpful for a better performance. Confusion matrix and heatmaps suggested that retinal lesions and quality of photographs could affect classification. Colour fundus images play a major role in glaucoma classification, compared to red-free fundus images. Conclusions Promising results with high AUC and specificity were shown in distinguishing normal optic nerve from glaucomatous fundus images and doing further classification. Supplementary Information The online version contains supplementary material available at 10.1186/s12886-022-02730-2.

(GON), corresponding retinal nerve fibre layer (RNFL) and visual field (VF) defects [2]. Since early symptoms could be insidious, effective glaucoma screening is important for early diagnosis, especially in health professional shortage areas.
Heidelberg retinal tomography (HRT), optical coherence tomography (OCT), VF tests, and colour fundus photography with red-free imaging, are pivotal armamentarium for glaucoma diagnosis [3]. Although HRT and OCT can detect changes of the optic disc and surrounding RNFL, quality of images and availability of facilities limit their wide application. By contrast, fundus imaging is easily equipped, less technique-dependant, and already widely used, which is reasonable to be a candidate facility for glaucoma screening. With redfree imaging, disc haemorrhage (DH) and wedge-shaped RNFL defects can be easily detected as clues of glaucoma; however, its value in DLS-facilitated glaucoma screening and classification is required to be explored.
Artificial intelligence (A.I.) with deep learning system (DLS) has widely been explored in ophthalmology for screening diabetic retinopathy (DR), macular degeneration, papilledema, and glaucoma [4][5][6][7]. Compared to commercialised products in detecting DR, DLS for glaucoma screening and classification is still under development. OCT scanning for RNFL thickness or combined with fundus images in presented various efficacy of glaucoma diagnosis and predicting progression with area under receiver operating characteristic curve (AUC) from 83 to 96% [8][9][10]. When detecting glaucoma with fundus images from referred diabetic patients, AUC of 94.2%, sensitivity of 96.4%, and specificity of 87.2% were found, respectively [11]. Efficacy of fundus imagingbased DLS showed that AUC, sensitivity, and specificity were 98.6%, 95.6%, and 92.0%, respectively in detection of GON [7]. When equipped with different image-cropping ratio on optic nerve head (ONH) or peripheral images in DLS, the results revealed that information from ONH and surrounding retina both contributed to glaucoma diagnosis [12]. With the pre-trained algorithm, even fundus photographs from smartphones can also be considered as an interface to screen glaucoma, which revealed better performance in advanced stage [13].
Besides glaucoma screening, glaucoma progression in myopic cohort with normal-tension glaucoma (NTG) had also been verified with machine learning [14]. Different from high-tension glaucoma (HTG), NTG is possibly overlooked due to its normal intraocular pressure (IOP) and requirements of mandatory ocular examinations and systemic survey to exclude other optic neuropathy before diagnosis. In published articles, DLS can reach an AUC of 0.966 in detecting structural changes with OCTbased parameters between glaucoma suspects and early NTG patients [15]. Although DH in fundus imaging is one common presentation of NTG, whether other phenotypes exist in fundus images to distinguish NTG from other types of glaucoma is not fully explored. Since different algorithms, enrolled parameters, and results exist between DLSs, we aimed to develop DLS for glaucoma screening and classification in this study.

Patient
The study was approved by the Institutional Review Board of Chang Gung Memorial Hospital, Linkou (No.201801801B0C601) and adhered to the tenets of the Declaration of Helsinki. Informed consent was waived in all patients and all images were turned into anonymous information before training and testing. Diagnosis and enrolment of glaucoma patients was based on Anderson's VF criteria. In brief, a vertically enlarged cupping, defect of RNFL in colour/red-free fundus images/OCT, and glaucomatous VF defect were documented to confirm glaucoma diagnosis. At least two consistent glaucomatous VF defects were recorded as baseline data before diagnosis, except for end-stage glaucoma patients with prominent clinical presentation and imaging findings, such as total cupping, pale disc, elevated IOP, and tunnel vision. Patients with HTG, NTG, and non-GON were enrolled. Among GON patients, those with IOP equal or higher than 22 mmHg were diagnosed as HTG. Treatment-naïve patients with long-term IOP equal or lower than 21 mmHg were viewed as NTG. Pre-perimetric glaucoma and glaucoma suspects were not enrolled.
Fundus images were taken with fundus cameras (Carl Zeiss VISUCAM 524, Canon CR-2AF, and KOWA nonmyd 8 s). The colour fundus photographs and red-free fundus images were taken in two ways, optic nerve head-centred and papillo-macular area-centred images. Although three machines for photography were used with different resolution, all enrolled images were resized into the same resolution before analysis. Demographics, including age, gender, high myopia and diagnosis, were collected. High myopia was defined as spherical equivalent equal or less than -6 D or axial length longer than 26 mm. All the fundus images were designated to train or test set. At first, we dispatched images to the train or test set based on the patients; therefore, same patient would not appear in the train and test set at the same time. Then, we further divided the train set into training and validation set based on eye level, which meant images from the same eye would be fully partitioned into either training or validation set.
We trained the DLS by using AutoDL API (Application Programming Interface), which is the API of MAIA software (Medical Artificial Intelligence Aggregator) (Muen Biomedical and Optoelectronic Technologist, Inc, Taipei city, Taiwan). We applied the convolutional neural network (CNN) with the model structure of EfficientNet-b0 pre-trained on ImageNet [16,17]. All fundus images were resized to 256*256. Data augmentations and dropout layers were applied to prevent overfitting [18]. Then, the extracted feature maps from CNN were flattened and concatenated with demographic features, which was inputted into the fully connected layers. The training epoch was 100, and the batch size was 32. The loss function was cross-entropy loss, and the optimizer was Adam [19]. During the training process, the learning rate was scheduled by a one-cycle of cosine annealing strategy [20,21]. Five-fold cross validation was performed to validate the models. Among the five models from fivefold cross validation, the one with the highest F1 score was chosen for model testing. In binary classification, images were classified into GON or non-GON by DLS with or without demographics. Similarly, in ternary classification, non-GON people, HTG, and NTG were classified with or without demographic data. Confusion maps and heatmaps were created after analysis.
AUC, accuracy, precision, sensitivity, specificity, and F1 score were used as outcome measures. Precision (positive predictive value) was defined as the fraction of true glaucoma among all pictures classified as glaucoma. F1 score was selected to evaluate the performance of model prediction. SPSS statistics software was used to calculate p value and other statistics. P value < 0.05 was viewed as statistically significant. The independent t test and the Chi-squared test were used to compare data in binary classification. One-way analysis of variance (ANOVA) with Tukey's honestly significant difference (HSD) test and the Chi-squared test were utilized to compare data in ternary classification and between combinations of demographic data.

Results
Two hundred and twenty-two cases (421 eyes) were enrolled, half male and half female, with 1851 raw images in sum. Among 421 eyes, 290 eyes presented healthy optic nerves and the rest 131 eyes had GON, of which 85 eyes were HTG and the other 46 eyes had long-term normal IOP.
In the binary classification, 1207 raw images of the optic disc were non-GON, and 644 images were GON. In ternary classification, 644 images of GON were further classified into 235 images of NTG and 409 images of HTG. There were 1851 images included in the dataset, in which 1231 images (283 eyes) were used as a training set and 308 images (68 eyes) were dispatched to a validation set. The rest 312 images (70 eyes) were prepared as a test set. Mean age of our healthy and GON patients in binary classification were 48.33 ± 18.54 and 61.22 ± 16.79 years, respectively, with significant difference (p < 0.001). In Chi-squared test, there was no difference between glaucoma and control group in gender (p = 0.49). In ternary classification, mean age of non-GON, NTG, and HTG patients were 48.33 ± 18.54, 60.1 ± 17.85, and 61.87 ± 16.28, respectively. P value < 0.001 was noted in ANOVA test, which meant three groups have significant difference in age distribution. Demographic data were shown in Table 1.
Our model was verified in two ways, including imageor eye-based analysis. Each image was used as one independent data in the former analysis; while, images from the same eye was annotated beforehand as a specific parameter for later analysis. The results in different analyses were presented in Tables 2 and 3. Five-fold cross validation were performed with no significantly different result. In brief, precision, accuracy, sensitivity, specificity, F1 score, and AUC in image-based glaucoma screening were 0.92, 0.79, 0.43, 0.98, 0.59, and 0.85 in test set. After providing the linkage between each image and the eye, the eye was classified as glaucoma if any of its images was predicted as positive. In this eye-based analysis, precision, accuracy, sensitivity, specificity, F1 score and AUC were 0.86, 0.91, 0.86, 0.94, 0.86, and 0.98 in test set, in which accuracy, sensitivity, F1 score, and AUC were largely improved, while precision and specificity slightly decreased. The receiver operating characteristic curves (ROC curves) in binary classification with or without demographic information in test set were shown in Fig. 1 (a and b). Confusion matrix to present image-or eye-based binary classification in test set was shown in Fig. 2 (a and b). Confusion matrix of binary classification after adding extra information was shown in Fig. 2 (c and d). We added information about age, gender, and high myopia into our model, no improvement was observed in the outcome measures in both validation and test set in binary classification (Tables 2 and 3). When comparing the outcome measures between red-free and colour fundus images, red-free imaging showed higher efficacy in most parameters in glaucoma screening, but not reached statistical significance (Tables 4 and 5, Fig. 3a).
In the heatmaps of binary classification, a weighted area was found outside non-GON optic disc at four quadrants ( Fig. 4 a and b). A weighted area temporal to the optic disc ( Fig. 4 c to h) was shown in the heatmaps of GON.
To verify DLS in ternary classification, validation set and test set with or without demographics were analyzed. The results in different sets were presented in Tables 2  and 3. To provide prediction without demographics in an eye-based manner, we averaged the predicted probabilities of each image. In this eye-based analysis of test set in ternary classification, all outcome metrics were improved, achieving an accuracy of 0.87, F1 score(macro) of 0.77, and AUC(macro) of 0.9. The ROC curves in ternary classification with or without demographics in test set were shown in Fig. 1 (c and d). Confusion matrix of ternary classification without demographics in test set was shown in Fig. 2 (e and f ). Distribution of our results of ternary classification after adding clinical information was shown in Fig. 2 g and h. No remarkable increase of all the outcome measures was noted after adding extra information into image-and eye-based analysis in ternary classification (Tables 2 and 3). We compared the outcome measures of red-free and colour fundus images, colour fundus images had a better performance in ternary classification with statistically significant difference (Tables 4 and 5, Fig. 3b), The results of ternary classification were also visualized in heatmaps, within which a weighted area was mainly supero-temporal to normal disc (Fig. 5 a and b). Heatmaps of the eyes with HTG showed a weighted area over the disc (Fig. 5 c and d). However, heatmaps of NTG presented a weighted area superior to the disc (

Discussion
In this study, an image-based or eye-based DLS was developed to perform glaucoma screening. Moreover, an algorithm was developed to verify ternary classification for non-GON, HTG, and NTG patients. Although we only enrolled 222 patients (421 eyes) with 1851 images, in image-based analysis of binary classification, AUC reached 0.85 in test set with the assistance of dropout function and data augmentation. In eye-based analysis, accuracy was improved from 0.79 to 0.91 and F1 score had achieved 0.86. In ternary classification, F1 score(macro) achieved 0.77, and AUC reached 0.9 in  [22][23][24], no remarkable improvement of performance has been found in our binary and ternary classification when providing demographics. In clinical settings, these factors are used to evaluate glaucoma suspect; however, it seems that image-only DLS is capable of doing screening and classification without additional information. Furthermore, impacts of age, gender, and myopia on the eye are fundamentally based on theories that aging oxidative stress to trabecular meshwork, structural change at the angle of anterior chamber, and circulation changes around optic nerve head. Other complicated influences of high myopia, such as peripapillary atrophy, retinal thinning, and tilted optic disc, also potentially play a role in glaucoma development. These molecular and structural changes may leave no discriminative clues in fundus images, resulting in less impacts in our results. Consequently, a simple fundus images-based screening system without demographics can be applied in telemedicine for fast screening.
Images of optic disc, OCT, VF, and clinical demographics had ever been chosen to verify the efficacy of glaucoma diagnosis with different algorithms in published studies. Li et al. evaluated efficacy of the DLS in detecting referable GON based on 70,000 colour fundus images alone from online dataset, presenting an AUC of 98.6%, sensitivity of 95.6%, and specificity of 92.0% [7]. Compared to their study, convincing result of our glaucoma screening was shown with an AUC of 98.0%, sensitivity of 86.0% and specificity of 94.0%, based on less images. Different methods of image extraction had also been integrated in fundus image-based DLSs, such as wavelet feature [25], features of ONH [26], and adaptive threshold-based image processing [27], in which the optic disc and RNFL were specifically segmented and extracted for analysis. However, misalignment and misclassification tend to develop when segmentation and localization fail to be synchronized. Since informative data exist in both optic nerve and the retina in glaucoma screening [9], in our study, we enrolled whole fundus images, including macula-centred, optic nerve head-centred, and red-free images to avoid overmanipulating data.
The advantage of our method is that it keeps most information within fundus images, explores the ability of DLSs, simulates the real-world clinical situation, and can be applied in daily practice. The disadvantage of analyzing the whole fundal pictures results from possible noise of any retinal or optic disc lesions and artifacts in images. When comparing the performance of DLS in binary and ternary classification with red-free and colour fundus images, red-free imaging seemed helpful in glaucoma screening but presented no statistical significance in our results. However, colour fundus images showed better and statistically significant performance in ternary classification. The sharper signal along RNFL defects in red-free imaging, compared to colour fundus images, may explain remarkable outcome measures in glaucoma screening and in clinical practice. However, indistinguishable RNFL defects may exist between HTG and NTG; therefore, colour images with more digital information are favoured in Fig. 1 The ROC curves with AUC in binary and ternary classification with or without demographics. Binary classification with (a) and without (b) information of age, gender, and high myopia in test set. Ternary classification with (c) and without (d) information of age, gender, and high myopia in test set  has not yet been made in which kind of images are suitable for DLS. How to balance pros and cons between maintaining enough amount of information and minimizing noise in images remains to be declared. Although demographics seemed to play less role in our DLS for glaucoma diagnosis and classification, linkage between images and the eyes showed meaningful impacts on performance. In glaucoma screening, eye-based analysis improved all the outcome measures, compared to image-based analysis, except for precision and specificity. This phenomenon may be attributed to increased false positive since glaucoma is diagnosed when one of the images from the same eye is predicted to be positive. By contrast, the strategy averaging probability of all images to predict a final diagnosis was used in our ternary classification. This strategy improved all outcome measures in ternary classification.
According to the confusion matrixes ( Fig. 2e-h), DLS for ternary classification is still effective to identify non-GON from GON, but less effective in identifying NTG from non-GON and HTG. This result may be attributed to several potential reasons, including small number of NTG eyes and natural entity of HTG/NTG that no remarkable morphological difference exists between their fundus images. By providing linkage between eyes and images, the performance can be improved in all outcome metrics. Moreover, performance can also be improved by using macro or micro averages when doing ternary classification. To further interpret confusion matrix, specificity was not significantly improved by adding demographics in both binary and ternary classification; meanwhile, this additional information did not remarkably improve classification of glaucoma. Similar to our multiple classification, one study aimed to identify GON with individual mean deviation in VF report from healthy people by stereo fundus images. Their results showed AUC from 0.89 to 0.97, according to different conditions [28]. Interestingly, performance of IOP prediction between a multivariate linear regression model (MLM) with 35 systemic variables and a DLS with colour fundus images showed that the former had a better predictive value [29]. The results may support that it may be better to use demographics to predict physiological parameters than to do glaucoma screening with images.
Heatmaps were used to visualize the viewpoint of the DLS. In binary classification, weighted area presented at peripheral retina in non-GON eyes and at the optic nerve in eyes with GON, presenting a different but efficient way for DLS to quickly identify glaucoma. Although DH or RNFL defect already existed in images, those GON misinterpreted into healthy optic disc may be resulted from artifact or other coexisting retinal lesions, such as macular pucker, myopic tessellated fundus, and large peripapillary atrophy (PPA), which showed that abnormal retinal presentations were first focused by DLS. Some glaucomatous images from the same eye were misinterpreted into non-GON at first; however, sensitivity from these data improved when linkage between images and the eye was built. Images of healthy optic disc that are misinterpreted into GON may be resulted from influence of tortuous vessel, underexposure area, and PPA in fundal images. The heatmap in ternary classification still showed a weighted area at the optic disc in HTG group. HTG images misinterpreted into NTG presented a weighted area over vascular bifurcation, arteriovenous nicking, or nasal retina. Similar to the heatmaps in binary classification, lesions of retina or optic disc such as disc hemorrhage could mislead DLS to a wrong classification, even though remarkable RNFL defect existed at the same time. Different from heatmaps in binary classification, a weighted area presented at the region supero-temporal to the healthy optic nerve in the ternary classification. This phenomenon showed that DLS used different strategy to analyze data in binary and ternary classification.
The limitations of our study include limited case numbers, lack of remarkable retinal or optic disc lesions other than glaucoma, single ethnic background, and exclusion of pre-perimetric glaucoma. Small number of training and validation sets was viewed as a drawback in machine learning, which may affect accuracy of glaucoma Fig. 3 The outcome measures in image-based analysis of red-free or colour fundus images. Test metrics calculated from red-free fundus images and colour fundus images were compared by paired t-test. In binary classification, red-free fundus images achieved better performance in number, which was not statistically significant (a). Colour fundus images achieved better performance in ternary classification, in which statistically significant differences were observed (b). n.s. = not statistically significant screening and lead to overfitting [4]. However, dropout function, data augmentation, and analysis at eye level were used to achieve applicable accuracy and AUC in glaucoma screening and classification. Glaucoma screening in combined ocular diseases and detection of preperimetric glaucoma are still major challenges for DLS. Fig. 4 Binary classification presented by heatmap to show weighted area of deep learning system. Non-glaucomatous fundus (a) showed a weighted area peripherally, outside optic disc (b). Weighted area presented temporal to optic disc in normal-tension glaucoma (c and d) and high-tension glaucoma (e and f) in binary classification. Red-free fundal picture and its associated heatmap (g and h) in our study

Fig. 5
Ternary classification presented by heatmap to show weighted area of deep learning system. Non-glaucomatous fundus (a) showed weighted area supero-temporal to optic disc (b). Weighted area presented at optic disc in high-tension glaucoma (c and d) in ternary classification. Misclassification of normal-tension glaucoma into high-tension glaucoma (e), showing weighted area nasal to optic disc (f) in the left eye. High-tension glaucoma was misclassified into normal-tension glaucoma (g), presenting a weighted area inferior to optic disc (h) in the right eye